In this notebook you can find some descriptive analysis of the energy consumption of the family and the forecasting of consumption for the future months. Indeed, the goal of the project is to “sell” the submetering devices showing to the customers how they can benefit from the data collected by the devices. So, on the one hand, we show which useful information they can obtain considering “real time data” and agregated data coming from the devices, on the other hand how we can forecast the energy consumption using past data.
# We set the options for the Rmarkdown
knitr::opts_chunk$set( warning = FALSE)
# We load the libraries
if(require("pacman")=="FALSE"){
install.packages("pacman")
}
pacman::p_load(plotly, corrplot, lubridate, hms, Hmisc, imputeTS,rstudioapi,kknn,ggplot2, randomForest, dplyr, shiny, tidyr, ggplot2, caret, forecast)
We look at the dataset structure.
data <- read.delim("household_power_consumption.txt",header = TRUE, sep=";")
head(data)
We put the variables in the right type.
data$Global_reactive_power = suppressWarnings(as.numeric(as.character(data$Global_reactive_power)))
data$Global_active_power = suppressWarnings(as.numeric(as.character(data$Global_active_power)))
data$Sub_metering_1 = suppressWarnings(as.numeric(as.character(data$Sub_metering_1)))
data$Sub_metering_2= suppressWarnings(as.numeric(as.character(data$Sub_metering_2)))
data$Global_intensity = suppressWarnings(as.numeric(as.character(data$Global_intensity)))
data$Voltage = suppressWarnings(as.numeric(as.character(data$Voltage)))
We change the unit of measure of the active power and reactive power in order to be the same of the submetering (watt-hour). We create a column with the date and the time joined. Finally we change the time in order to take in consideration the solar and legal hour.
#Change the unite measure
data$Global_active_power = data$Global_active_power * 1000 / 60
data$Global_reactive_power = data$Global_reactive_power * 1000 / 60
#create a column in the format strptime
data<-cbind(data,paste(data$Date,data$Time), stringsAsFactors=FALSE)
colnames(data)[10] <-"DateTime"
data<- data[,c(ncol(data), 1:(ncol(data)-1))]
data$DateTime <- strptime(data$DateTime, "%d/%m/%Y %H:%M:%S", tz="GMT")
data$Date <- NULL
data$Time <-NULL
#solar and legal hour
a = which(data$DateTime == as.POSIXct("2007-03-25 02:00:00", tz = "GMT") | data$DateTime == as.POSIXct( "2007-10-28 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2008-03-30 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2008-10-26 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2009-03-29 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2009-10-25 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2010-03-28 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2010-10-31 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
# Recreate the date column and reorder the order of the columns.
data$Date <- as.Date(data$DateTime)
data=data[,c(1,9,2:8)]
We look at some summary measures of the variables.
summary(data)
DateTime Date Global_active_power
Min. :2006-12-16 17:24:00 Min. :2006-12-16 Min. : 1.267
1st Qu.:2007-12-12 00:18:30 1st Qu.:2007-12-12 1st Qu.: 5.133
Median :2008-12-06 07:13:00 Median :2008-12-06 Median : 10.033
Mean :2008-12-06 07:48:33 Mean :2008-12-05 Mean : 18.194
3rd Qu.:2009-12-01 14:07:30 3rd Qu.:2009-12-01 3rd Qu.: 25.467
Max. :2010-11-26 21:02:00 Max. :2010-11-26 Max. :185.367
NA's :25979
Global_reactive_power Voltage Global_intensity Sub_metering_1
Min. : 0.000 Min. :223.2 Min. : 0.200 Min. : 0.000
1st Qu.: 0.800 1st Qu.:239.0 1st Qu.: 1.400 1st Qu.: 0.000
Median : 1.667 Median :241.0 Median : 2.600 Median : 0.000
Mean : 2.062 Mean :240.8 Mean : 4.628 Mean : 1.122
3rd Qu.: 3.233 3rd Qu.:242.9 3rd Qu.: 6.400 3rd Qu.: 0.000
Max. :23.167 Max. :254.2 Max. :48.400 Max. :88.000
NA's :25979 NA's :25979 NA's :25979 NA's :25979
Sub_metering_2 Sub_metering_3
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.000 Median : 1.000
Mean : 1.299 Mean : 6.458
3rd Qu.: 1.000 3rd Qu.:17.000
Max. :80.000 Max. :31.000
NA's :25979 NA's :25979
The missing values are the same quantity for all the numerical variables. Morevore we hace checked they are missing in the same rows. We take a look at how they are distributed. In particuar at their daily quantity.
#create a table with the missing values
DATA=data[which(is.na(data$Global_reactive_power) == TRUE),]
t = table(DATA$Date)
print(t)
2006-12-21 2006-12-30 2007-01-14 2007-01-28 2007-02-22 2007-03-25 2007-04-28
2 2 1 1 2 1 1359
2007-04-29 2007-04-30 2007-06-01 2007-06-06 2007-06-09 2007-06-19 2007-06-29
1440 924 1 1 38 2 1
2007-07-15 2007-07-22 2007-08-01 2007-08-24 2007-09-26 2007-10-23 2007-11-21
130 1 21 1 2 2 1
2007-11-29 2007-12-17 2008-01-13 2008-02-02 2008-02-23 2008-03-24 2008-05-16
1 1 1 1 2 1 2
2008-06-13 2008-07-13 2008-08-04 2008-08-31 2008-10-25 2008-11-10 2008-11-12
1 2 1 1 43 6 2
2008-11-23 2008-12-10 2008-12-20 2009-01-14 2009-02-01 2009-02-14 2009-02-17
1 70 1 1 38 2 24
2009-03-01 2009-03-16 2009-04-13 2009-05-10 2009-05-26 2009-06-13 2009-06-14
1 1 2 1 3 1350 1440
2009-06-15 2009-07-10 2009-08-13 2009-09-13 2009-09-30 2009-10-11 2009-11-09
515 4 891 1 2 1 1
2009-12-10 2010-01-02 2010-01-12 2010-01-13 2010-01-14 2010-01-23 2010-02-10
2 1 547 1440 1142 1 1
2010-02-14 2010-03-20 2010-03-21 2010-04-11 2010-05-13 2010-06-12 2010-06-29
1 1208 819 1 1 1 1
2010-07-15 2010-08-17 2010-08-18 2010-08-19 2010-08-20 2010-08-21 2010-08-22
1 118 1440 1440 1440 1440 1348
2010-09-25 2010-09-26 2010-09-27 2010-09-28 2010-10-24
1144 1440 1440 1213 1
We see that there are consecutive days where we have null values for all the minutes in that days. We can then suppose that during that time the family was away and turned off the electricity. We can then substitute these NA with 0, cause it represent the real “consumption of energy” during that days. There are then days where there are very few null values, we can suppose that this is due to temporary malfunction of the machine. We can then substitute these NA with the previous non-null value.
# I create a new dataset where I am going to apply the substitution.
mydata = data
#create a list with the dates where I want to substitute the null values with 0.
l = c()
for (i in 1:length(row.names(t)))
{
if (t[i]>=178)
{l = c(l, row.names(t)[i])}
}
a = which( (as.character(mydata$Date) %in% l) & is.na(mydata$Global_active_power) )
mydata[a,3:9] = c(0,0,0,0,0,0,0)
# For the remaining days we substitute the null values with the previous no-null value
for (i in 3:9)
{
mydata[,i] = na_locf(mydata[,i], option='locf')
}
We create a new column with the energy that is used in the house but not in the devices connected to the submetering 1,2 and 3. In this way we have divided the total electricity used by the familiy in four different sources.
#create a column with energy not used in submetering 1,2,3
mydata$other = as.numeric(mydata$Global_active_power-mydata$Sub_metering_1-mydata$Sub_metering_2-mydata$Sub_metering_3)
We made different plots to understand which useful information can be extracted from the device using different aggregation of time. Here in the notebook you can see some examples.
We look at the energy consuption per minute in the different submetering. we would like to show how it is possible to identify some electrical appliances. Indeed, every appliance has a specific pattern in the consumption of energy. This information can be used to check if there are malfunctions or if the device is “too old” and it consuming more energy then expected. For example, in the following graph you can see the consumption of energy of the dishwasher. Thanks to this level of details we have in the data, we are able to calculate the energy used for washing the dishes.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered[c(1200:1440),],aes(x = c(1200:1440), y = Sub_metering_1))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]), "dishwasher")+xlab('minutes')+
ylab('Sub metering 1')+theme_minimal()
print(g)
If we look at the submetering two, we can see a pattern that keeps repearing in the consumption of energy. This is related with the fridge. Indeed, this appliance needs always energy at regular intervals. It is interesting that if a family monitor the energy real time, it could be possible to know immediately if the fridge remains open for too long. This can represent a useful application of the submetering devices.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:1440), y = Sub_metering_2))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]), "fridge")+xlab('minutes')+
ylab('Sub metering 2')+theme_minimal()
print(g)
If we look at the energy used that it is not tracked by the submetering devices, we can see that it more difficult to capture the behaviour of the single appliances. Nevertheless, it is possible to check when the peaks of energy occur during the days and this can be a useful information to monitor that noting unexpected has happened.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:1440), y = other))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]))+xlab('minutes')+
ylab('Other consumption')+theme_minimal()
print(g)
The consumption of energy aggregated by hour it is an indication of when the family is more “active” at home. This information can be used by the family to distribute better the consumption of energy in order to save money. Indeed, the electricity can have a different price depending on which hour is used.
# First we aggregate the energy by hour.
mydatah= mydata[,2:ncol(mydata)]
mydatah$hour = format( mydata$DateTime, format="%H")
mydatahours = data.frame(mydatah %>% group_by(Date, hour ) %>% summarise_all(sum))
# We plot the energy consumption in one specific day
dayconsidered = subset(mydatahours, day(Date)== 5& month(Date)==3 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:24), y = Global_active_power))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]))+xlab("hour")+theme_minimal()
print(g)
We aggregate the energy used in a mounth and we look at how the energy consumption change during the months.
# We aggregate the data
mydatamonth = mydata
mydatamonth$monthyear = format(mydata$Date, "%Y-%m")
mydatamonth$DateTime = NULL
mydatamonth$Date = NULL
mydatamonths = mydatamonth %>% group_by(monthyear) %>% summarize_all(sum)
mydatamonths$year = as.factor(substring(mydatamonths$monthyear,1,4))
mydatamonths$month = as.factor(substring(mydatamonths$monthyear,6,8))
First we look at the global energy used in the different months. It is immediately clear that the energy consumption change with the seasons. Indeed we reach the peak during the winter and the trough during summer.
ggplot(data=mydatamonths[c(2:48),], aes(x=month, y= Global_active_power, group= year)) +
geom_line(aes(color= year))+
geom_point()+
theme(legend.position="top")+labs(y= "Global active energy", x = "Month", color=' Years')+theme_minimal()
We can then look how much of the energy is used during the year for the different group of appliance that are connected to the different submetering devices.
# First we group the enrgy by year
mydatayears = data.frame(mydatamonths[,2: (ncol(mydatamonths)-1)] %>% group_by(year) %>% summarize_all(sum))
# Then we can plot the pie chart for one year.
ds <- data.frame(labels = c("Sub-metering 1", "Sub-metering 2", "Submetering 3", 'Other energy used'), values = unlist(mydatayears[4,c(6,7,8,9)]))
plot_ly(ds, labels = ~labels, values = ~values) %>%add_pie() %>% layout(title = "2009")
Finally, we can see how during the year the consummption of electricity changes in the different submetering.
plot_ly(mydatamonths %>% filter(year==2009), x = ~month, y = ~Sub_metering_1, type = 'bar', name = 'Sub metering 1') %>%
add_trace(y = ~Sub_metering_2, name = 'Sub metering 2') %>%
add_trace(y = ~Sub_metering_3, name = 'Sub metering 3') %>%
add_trace(y =~other, name = 'Other energy used') %>%
layout(yaxis = list(title = 'Energy'),xaxis = list(title = 'Month'), title = "2009" , barmode = 'stack')
When we have to forecast, we use a different approach with the null values. Indeed, since the holidays of the family were not always in the same period, we decide to fill that missing values using the usual consumption of energy in a similar period (instead of substituting them with 0). For shorter period of missing values, we consider the previous non null value.
energy = data
energy$Voltage <-NULL
energy$Global_intensity <-NULL
# As before we look at number of null values per day
DATA=data[which(is.na(data$Global_reactive_power) == TRUE),]
t = table(DATA$Date)
# In the day where we have few missing values we use the previous non null value
mv= c()
for (i in 1:length(t))
{
if (t[i] < 178 )
{mv = c(mv, row.names(t)[i])}
}
b = which( (as.character(energy$Date) %in% mv) )
for(i in 3:7)
{
energy[b,i] = na_locf(energy[b,i], option='locf')
}
# For the rest of the missing values, we consider all the values of energy, that were registered in the different years in the same season, weekday and minute, and we subtitute a random value from them.
energy$season = 'winter'
energy$season[month(energy$DateTime)>2 & month(energy$DateTime) <6 ]='spring'
energy$season[month(energy$DateTime)>5 & month(energy$DateTime) <9 ]='summer'
energy$season[month(energy$DateTime)>8 & month(energy$DateTime) <12 ]='autumn'
energy$season = as.character(energy$season)
energy$weekdays = weekdays(energy$DateTime)
ll= unique(energy$season)
l = unique(energy$weekdays)
for (i in ll){
for (j in 0:23){
for (k in 1:length(l))
{
index = which( hour(energy$DateTime) == j & as.character(energy$weekdays)== l[k] & as.character(energy$season)==i )
for (m in 3:7)
{energy[index,m] = as.numeric(impute(energy[index,m],'random'))}
}}}
After have examinated the data with different granularities (days, weekdays,etc.), we have decided to concentrate in predicting the energy by month. Indeed there is clearly a seasonality in this case. In the following code we show the forecast we have obtained for the global energy using two well known methods for time series forecasting.
To work with time series forecasting, first we have to create a timeseries object with the right frequency. Then we need to take care of the last observation (November 2010) since we are missing 4 days of that month in the data. We consider then the average daily consumption during that month and we add the “missing energy” to the last value of the time serie.
# Create a time serie object
energymonths = data.frame(energy %>% mutate(monthyear = format(mydata$Date, "%Y-%m")) %>% select(monthyear, Global_active_power) %>% group_by(monthyear) %>% summarize_all(sum))
energymonths = ts(energymonths[c(2:48),2], start=2007, frequency = 12)
# Adding the energy of 4 missing days of November 2010
December2010 = energy %>% filter(year(Date)== 2010, month(Date) == 11) %>% select(Date, Global_active_power) %>% group_by(Date) %>% summarize_all(sum)
meanconsumption = mean(December2010$Global_active_power)
energymonths[47] = energymonths[47] + 4*meanconsumption
# Create the training and the test
training = subset(energymonths, start = 1 ,end = 36)
testing= subset(energymonths, start= 37, end= 47)
In the following graph, you can see the time serie we are analysing.
autoplot(energymonths)+theme_minimal()
We can see that we have two outliers in the time series, respectively December 2007 and August 2008. We decide to substitute the value of August with the mean between the previous and following August. We will leave December 2007 as it is, since we have just three observations in the same month (for August 2008 we have four).
# We check the outliers also with an automatic function
ouliers<- tsoutliers(energymonths)
# We substitute the August 2008
energymonths[20] = 0.5*(energymonths[8]+energymonths[32])
We recheck the time series obtained.
autoplot(energymonths)+theme_minimal()
We create the traning and the test. For the training we will consider the first three years of data, for the test the last eleven months of the last year.
# Create the training and the test
training = subset(energymonths, start = 1 ,end = 36)
testing= subset(energymonths, start= 37, end= 47)
Looking at the time serie, we know that the prediction will be based on the seasonal component.
First we look at an exponential smoothing models (Holt-Winters models). In the following image we can see the results we obtain in the training using the automatic search. In particular the values for the parameters alpha and gamma, and the distribution of the residuals.
# We look at the results obtained by the default method
fit_ets= ets(training)
summary(fit_ets)
ETS(M,N,M)
Call:
ets(y = training)
Smoothing parameters:
alpha = 1e-04
gamma = 1e-04
Initial states:
l = 812173.0479
s = 1.3161 1.2009 1.0315 0.8784 0.657 0.6484
0.8051 0.9385 0.9447 1.1591 1.0936 1.3267
sigma: 0.0839
AIC AICc BIC
940.7328 964.7328 964.4856
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -7828.695 52846.78 41746.67 -1.49427 5.381002 0.6478047
ACF1
Training set 0.1486854
As expected, the model suggested doesn’t have a trend component but just a seasonal one which is multiplicative. The small value we obtain for the seasonality means that we give importance to all the previous observation during the same month. We do the following test to look at the residuals.
checkresiduals(fit_ets)
Ljung-Box test
data: Residuals from ETS(M,N,M)
Q* = 24.53, df = 3, p-value = 1.936e-05
Model df: 14. Total lags used: 17
There were 50 or more warnings (use warnings() to see the first 50)
The p-value in the Ljung-Box test suggest that the residuals are not uncorrelate. This means that there could be some information in the time series that we are not capturing. Finally, We look at the error on the test.
forecast1 = forecast(fit_ets,h = 11)
a1= accuracy(as.numeric(testing), forecast1$mean)
print(a2)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -8575.367 62429.64 53915.02 -1.992342 7.782517 0.08650557 0.4208285
We plot the forecasting in the test versus the actual values.
autoplot( energymonths)+autolayer(forecast1, PI = F)+theme_minimal()
We try to use also the additive method for the seasonality, so the model (M,N,A).
fit_ets_additive = ets(training, model = "MNA")
summary(fit_ets_additive)
ETS(M,N,A)
Call:
ets(y = training, model = "MNA")
Smoothing parameters:
alpha = 0.1644
gamma = 1e-04
Initial states:
l = 813834.1728
s = 266616.5 152973.6 25876.89 -101962.1 -261276.6 -270222.2
-146805.7 -52672.78 2385.092 109901.9 30975.14 244210.2
sigma: 0.089
AIC AICc BIC
944.9158 968.9158 968.6686
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -5480.248 58414.76 39359.19 -1.464873 5.198394 0.610757
ACF1
Training set 0.1194973
checkresiduals(fit_ets_additive)
Ljung-Box test
data: Residuals from ETS(M,N,A)
Q* = 19.183, df = 3, p-value = 0.0002506
Model df: 14. Total lags used: 17
forecast2 = forecast(fit_ets_additive,h = 11)
a2= accuracy(as.numeric(testing), forecast2$mean)
print(a2)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -15883.18 63236.82 58016.64 -2.172927 7.873018 -0.1587469 0.5125032
We obtain worst results in the test respect to the previous model, as one can see looking at the RMSE, MAE and MAPE.
We do a similar study with the ARIMA model. We first look at the parameters given by the automatic function.
fit.arima = auto.arima(training)
summary(fit.arima)
Series: training
ARIMA(0,0,0)(0,1,0)[12]
sigma^2 estimated as 7.713e+09: log likelihood=-307.25
AIC=616.5 AICc=616.68 BIC=617.68
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -9846.649 71709.21 43234.1 -1.468423 5.774239 0.6708859
ACF1
Training set 0.1347819
The arima model found is the one that consider just to differenciate by year (lag 12). It means that the forecast will be the naive one, considering the observation of the same month in the previous year.
We study the residuals.
checkresiduals(fit.arima)
Ljung-Box test
data: Residuals from ARIMA(0,0,0)(0,1,0)[12]
Q* = 2.8616, df = 7, p-value = 0.8975
Model df: 0. Total lags used: 7
This time the test suggests that the residuals are not correlated. We look at the error on the test.
forecast3 = forecast(fit.arima,h = 11)
a3= accuracy(as.numeric(testing), forecast2$mean)
print(a3)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -15883.18 63236.82 58016.64 -2.172927 7.873018 -0.1587469 0.5125032
We plot the forecast in the testing, that it is simply the previous observation taken during the same month.
autoplot( energymonths)+autolayer(forecast2, PI = F)+theme_minimal()
Comparing the results we have obtained on the test, we have seen that the best model is ETS(M,N,M). We can then forecast using this model the energy consumption of the family for 2011.
model = ets(energymonths, model = "MNM")
forecast4 = forecast(model, h = 12)
autoplot( energymonths)+autolayer(forecast4, PI = T)+theme_minimal()